Adding in Fault Tolerance for External Storage Systems
When implementing
centralized storage solutions, you are often placing a large number of
very important eggs into a single basket. SAN and NAS manufacturers
understand this and have spent a lot of research and development dollars
on building in fault tolerance into their offerings. Many options are
available to the end user; some of the fault-tolerance options are as
follows:
RAID configurations— RAID levels 0+1 and 5 are most common. RAID level 6 offers the ability to lose two drives at a time and not lose data.
Triple mirroring—
This enables you to snap off a mirror so that data becomes static for
purposes of backup. Meanwhile, the system still has mirrored drives for
fault tolerance. This is most commonly used with databases.
Log shipping—
Most SAN and NAS devices can copy log files in near real time to
another SAN or NAS so that databases can be copied regularly and log
files can be kept in sync remotely.
Geographic mirroring—
SAN and NAS devices offer in-band and out-of-band options for mirroring
data across wide distances. Whereas SCSI has a 25-foot limitation,
Fibre Channel can locate a device up to 1,000km away.
Snapshotting—
By flagging disk blocks as being part of a particular version of a file
and writing changes to that file on new blocks, a NAS or SAN device can
take a snapshot of what the data looked like at a point in time. This
enables a user to roll back a file to a previous version. It also enables you to roll an entire system back to a point in time almost instantly.
Clustering—
NAS devices that use heads to serve data offer dual heads so that if
one fails, the other continues to serve data from the disks.
Redundant power systems—
Any good SAN or NAS offers multiple power supplies to protect against
failure. Take advantage of the separate power supplies by attaching them
to separate electrical circuits.
Redundant backplanes— Many NAS and SAN devices offer redundant backplanes to protect against hardware failure.
Hot standby drives—
By having unused drives available in the chassis, the device can
replace a failed disk instantly with one that is already present and
ready for use. Be sure to monitor the SAN or NAS device to see if a disk
has failed and been replaced. It can be easy to miss because there is
no interruption to service.
Although Exchange
2007 offers functions such as Cluster Continuous Replication to provide
for server-level fault tolerance, it is still a best practice to provide
disk-level redundancy for the individual servers. With the reduced
dependence on disk I/O in Exchange 2007 servers equipped with large
amounts of system memory, RAID 5 will become a more common configuration
on Mailbox servers. If utilizing a caching controller for the RAID
controller, be sure that the cache is protected by a battery backup.
Failure to do so could result in lost data that was cached in the
controller during a failure. If the cache isn’t committed to the disk,
the data will be in an inconsistent state and most likely will not be
usable.
Tip
RAID 5 is not recommended
for any application that performs write transactions more than about 30%
of the time. This is because each write transaction requires reading
multiple disks and recalculating and writing of parity bits.
Recommendations for SAN and NAS Solutions
SAN and NAS manufacturers
have provided a number of technologies that make it easier to integrate
their products with specific software products. Because these products
having been available for a number of years, best practices around these
implementations have been developed and can help you avoid common
pitfalls with SAN and NAS usage.
Recommendations for Exchange with NAS/SAN Environments
When implementing a NAS or
SAN solution in a Microsoft Exchange environment, many different
interpretations abound on the best way to implement the solution. Some
of the recommendedbest practices are as follows:
Run
multiple HBAs in each Exchange server with each HBA connected to a
different Fibre Channel switch. This allows for failover if one of the
Fibre Channel switches should fail.
Ensure
that zoning of the SAN is configured correctly so that only the
necessary systems can see the LUNs. In the case of a cluster, all nodes
that might potentially own the disks should be in the same zone. If an
unrelated Windows system sees the disks, it tries to write a new
signature to the disk, which makes it unreadable by the intended hosts.
Backups
should be performed at the storage group level rather than at the
mailbox level. Mailbox-level backups are very processor intensive for
the Exchange server.
If available, direct disk backup solutions are significantly faster than storage group level backups.
If
you are implementing third-party applications with your NAS or SAN for
use with Exchange, make sure they are certified by Microsoft for use
with Exchange 2007 and that they use the standard application
programming interfaces (APIs) such as Volume Shadow Copy Services.
Separate
log files from databases onto different drive sets. This improves
overall throughput and improves recoverability in the case of a NAS/SAN
failure.
Replicate
databases hourly to another device for disaster recovery. Logs should
be replicated every few minutes. This limits potential mail loss to one
log replication interval.
Always
use integrated tools if they are available, such as Network Appliance’s
SnapManager for Exchange. They greatly simplify management and
recoverability of the product for which they were designed.
Always
plan for space reservation on a volume. If the database will grow to
80GB and will have snapshots taken for recoverability, reserve 160GB of
space on the device.
When
possible, expand capacity on the Exchange 2007 server via additional
mailbox databases placed on new LUNs. Although LUNs can be dynamically
grown, it is usually a very time-consuming process and will impact
system performance on the Exchange server.
Avoid
placing multiple virtual logical disks or LUNs on the same RAID group.
This could result in databases and log files being on the same RAID
group. This would complicate system recoveries if the RAID group were to
fail.
Consolidating the Number of Exchange Servers via NAS or SAN
Exchange servers were
traditionally sized based not only on performance potential, but also on
the time needed to recover a system. Administrators knew that if they
had a 4-hour SLA for system recovery they could count on using half that
time to recover data from
tape and half that time to perform the recovery tasks. This meant that
they could only have as much local storage as they could recover in 2
hours. So, if a backup/restore system could restore 16GB of data in 2
hours and each user was allowed 100MB of storage, the maximum number of
users on the system would be 160. For a company of 1,600 users, this
would mean 10 Exchange servers would be required to support the 4-hour
SLA.
By placing the mailbox
stores onto a NAS or SAN device that can be mirrored and snapshotted,
the recoverability time for a 16-GB database would drop to mere minutes.
Now the bottleneck would become the performance of the server itself
and possibly the I/O rate of the NAS or SAN. Odds are that the systems
that had been purchased for the ability to support 160 users would be
dual-processor systems with 1 to 2GB of memory. By reducing the server
count to two and fully populating those two systems with memory taken
from the retired systems, the two systems with NAS- or SAN-based
mailboxes could easily support the 800 users each and still meet the
4-hour recovery time required by the SLA. This would result in the
reduction of eight Exchange servers, which would free up OS licenses and
hardware as well as reduce the effort required to manage the data
center.
Tip
When consolidating
Exchange servers, consider taking some of the newly freed up Exchange
servers to be used as Cluster Continuous Replication Exchange servers or
place them in the lab to be used for recovery and testing of patches.